2025-05-12-12-04

Understanding Stragglers in Large Model Training Using What-if Analysis

Abstract

arXiv:2505.05713v1 Announce Type: new Abstract: Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?

摘要

大语言模型(LLM)训练是当前最具挑战性的分布式计算任务之一，通常需要数千个GPU并频繁进行跨机器同步。这种工作负载模式使其容易受到落后节点(stagger)影响，少数速度较慢的工作节点即可导致整个训练停滞。字节跳动研究发现，落后节点并非总是由硬件故障简单引起，而是可能源于多种复杂因素。本研究基于字节跳动LLM训练集群的五个月追踪数据，旨在对LLM训练中的落后节点问题进行系统性分析。核心研究方法是通过假设分析模拟无落后节点的理想场景，并与实际情况进行对比。我们运用该方法探究以下问题：(1)落后节点影响训练任务的频率及其对作业性能的影响程度；(2)落后节点是否呈现时间或空间上的规律性；(3)导致落后节点的潜在根本原因有哪些？

An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact

Abstract

arXiv:2505.05494v1 Announce Type: new Abstract: The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation, creating a critical demand for precise, asset-level environmental impact data. Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection, which limits regulatory compliance and accurate environmental modeling. This study presents an automated, end-to-end data extraction pipeline that uses LLMs to create, clean, and validate structured databases, specifically targeting sectors with a high risk of deforestation. The pipeline introduces Instructional, Role-Based, Zero-Shot Chain-of-Thought (IRZ-CoT) prompting to enhance data extraction accuracy and a Retrieval-Augmented Validation (RAV) process that integrates real-time web searches for improved data reliability. Applied to SEC EDGAR filings in the Mining, Oil & Gas, and Utilities sectors, the pipeline demonstrates significant improvements over traditional zero-shot prompting approaches, particularly in extraction accuracy and validation coverage. This work advances NLP-driven automation for regulatory compliance, CSR (Corporate Social Responsibility), and ESG, with broad sectoral applicability.

摘要

欧盟《反森林砍伐条例》（EUDR）要求企业证明其产品未导致森林砍伐，这催生了对于精确资产级环境 impact 数据的迫切需求。现有数据库因过度依赖宽泛的财务指标和人工数据收集而缺乏必要细节，制约了法规遵从与环境建模的准确性。本研究提出一种自动化端到端数据提取流程，利用大语言模型（LLMs）构建、清理和验证结构化数据库，特别针对高森林砍伐风险行业。该流程创新性地采用基于指令-角色-零样本思维链（IRZ-CoT）的提示策略提升数据提取精度，并引入检索增强验证（RAV）机制，通过实时网络搜索提高数据可靠性。在美国证券交易委员会EDGAR系统采矿、石油天然气及公用事业领域备案文件的应用表明，相较于传统零样本提示方法，该流程在提取准确性与验证覆盖度方面均有显著提升。本研究推动了自然语言处理技术在法规遵从、企业社会责任（CSR）及环境社会治理（ESG）领域的自动化应用，具有广泛的行业适用性。

HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics

Abstract

arXiv:2505.05602v1 Announce Type: new Abstract: As Large Language Models (LLMs) and other AI systems evolve, robustly estimating their capabilities from inherently stochastic outputs while systematically quantifying uncertainty in these estimates becomes increasingly important. Further, advanced AI evaluations often have a nested hierarchical structure, exhibit high levels of complexity, and come with high costs in testing the most advanced AI systems. To address these challenges, we introduce HiBayES, a generalizable Hierarchical Bayesian modeling framework for AI Evaluation Statistics. HiBayES supports robust inferences in classical question-answer benchmarks and advanced agentic evaluations, particularly in low-data scenarios (e.g., < 20 data points per evaluation). Built on Generalized Linear Models (GLMs), Bayesian data analysis, and formal model comparison, HiBayES provides principled uncertainty quantification and robust parameter estimation. This paper offers a comprehensive introduction to HiBayES, including illustrative examples, comparisons to conventional statistical methods, and practical guidance for implementing multilevel Bayesian GLMs. Additionally, we provide a HiBayES software package [4] (Beta version) for out-of-the-box implementation.

摘要

随着大语言模型（LLMs）及其他人工智能系统的快速发展，从本质上具有随机性的输出中稳健地评估其能力，并系统量化这些评估中的不确定性变得愈发重要。此外，先进的人工智能评估通常具有嵌套的层次结构，展现出高度复杂性，且测试最先进人工智能系统的成本高昂。为应对这些挑战，我们提出了HiBayES——一个可推广的层次贝叶斯建模框架，专为人工智能评估统计而设计。HiBayES支持在经典问答基准测试和高级智能体评估中进行稳健推断，尤其适用于低数据场景（如每次评估少于20个数据点）。该框架基于广义线性模型（GLMs）、贝叶斯数据分析和形式化模型比较，能够提供严格的不确定性量化和稳健的参数估计。本文全面介绍了HiBayES，包括示例演示、与传统统计方法的对比，以及实现多层次贝叶斯GLMs的实用指南。此外，我们还提供了开箱即用的HiBayES软件包[4]（测试版）。

Leveraging Large Language Models for enzymatic reaction prediction and characterization

Abstract

arXiv:2505.05616v1 Announce Type: new Abstract: Predicting enzymatic reactions is crucial for applications in biocatalysis, metabolic engineering, and drug discovery, yet it remains a complex and resource-intensive task. Large Language Models (LLMs) have recently demonstrated remarkable success in various scientific domains, e.g., through their ability to generalize knowledge, reason over complex structures, and leverage in-context learning strategies. In this study, we systematically evaluate the capability of LLMs, particularly the Llama-3.1 family (8B and 70B), across three core biochemical tasks: Enzyme Commission number prediction, forward synthesis, and retrosynthesis. We compare single-task and multitask learning strategies, employing parameter-efficient fine-tuning via LoRA adapters. Additionally, we assess performance across different data regimes to explore their adaptability in low-data settings. Our results demonstrate that fine-tuned LLMs capture biochemical knowledge, with multitask learning enhancing forward- and retrosynthesis predictions by leveraging shared enzymatic information. We also identify key limitations, for example challenges in hierarchical EC classification schemes, highlighting areas for further improvement in LLM-driven biochemical modeling.

摘要

预测酶促反应对于生物催化、代谢工程和药物发现等应用至关重要，但这仍是一项复杂且资源密集的任务。大型语言模型（LLMs）近期在多个科学领域展现出显著成效，例如通过其知识泛化能力、复杂结构推理能力以及上下文学习策略的运用。本研究系统评估了LLMs（特别是Llama-3.1系列的8B和70B模型）在三个核心生化任务中的表现：酶学委员会编号预测、正向合成及逆合成。我们比较了单任务与多任务学习策略，并采用基于LoRA适配器的参数高效微调方法。此外，通过不同数据规模下的性能评估，探究了模型在低数据环境中的适应性。结果表明，经微调的LLMs能够捕捉生化知识，且多任务学习通过共享酶学信息提升了正向与逆合成预测性能。同时，我们也发现了关键局限性，例如在层级式EC分类体系中的挑战，这为LLM驱动的生化建模指明了进一步改进的方向。

scDrugMap: Benchmarking Large Foundation Models for Drug Response Prediction

Abstract

arXiv:2505.05612v1 Announce Type: new Abstract: Drug resistance presents a major challenge in cancer therapy. Single cell profiling offers insights into cellular heterogeneity, yet the application of large-scale foundation models for predicting drug response in single cell data remains underexplored. To address this, we developed scDrugMap, an integrated framework featuring both a Python command-line interface and a web server for drug response prediction. scDrugMap evaluates a wide range of foundation models, including eight single-cell models and two large language models, using a curated dataset of over 326,000 cells in the primary collection and 18,800 cells in the validation set, spanning 36 datasets and diverse tissue and cancer types. We benchmarked model performance under pooled-data and cross-data evaluation settings, employing both layer freezing and Low-Rank Adaptation (LoRA) fine-tuning strategies. In the pooled-data scenario, scFoundation achieved the best performance, with mean F1 scores of 0.971 (layer freezing) and 0.947 (fine-tuning), outperforming the lowest-performing model by over 50%. In the cross-data setting, UCE excelled post fine-tuning (mean F1: 0.774), while scGPT led in zero-shot learning (mean F1: 0.858). Overall, scDrugMap provides the first large-scale benchmark of foundation models for drug response prediction in single-cell data and serves as a user-friendly, flexible platform for advancing drug discovery and translational research.

摘要

耐药性是癌症治疗中的主要挑战。单细胞分析技术为细胞异质性研究提供了新视角，但基于大规模基础模型预测单细胞数据药物反应的应用仍待探索。为此，我们开发了scDrugMap集成框架，该框架包含Python命令行界面和网页服务器两种药物反应预测工具。scDrugMap系统评估了多种基础模型（包括8个单细胞模型和2个大语言模型），使用经过筛选的326,000余个主数据集细胞和18,800个验证集细胞（涵盖36个数据集及多种组织和癌症类型）进行测试。我们通过数据池化和跨数据集两种评估场景，采用层冻结和低秩自适应（LoRA）微调策略对模型性能进行基准测试。在数据池化场景中，scFoundation表现最佳（平均F1分数：层冻结0.971，微调0.947），性能比最低模型高出50%以上。在跨数据集场景中，UCE在微调后表现最优（平均F1分数0.774），而scGPT在零样本学习中领先（平均F1分数0.858）。scDrugMap首次为单细胞数据药物反应预测提供了大规模基础模型基准测试，并构建了用户友好、灵活的平台以推动药物发现和转化研究。

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

Abstract

arXiv:2505.05856v1 Announce Type: new Abstract: Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively support. In this paper, we introduce DawnPiper, a memory-scalable pipeline parallel training framework. Firstly, we develop a DL compilation-based profiling method that transforms the model into a fine-grained computation graph. This refinement gives us a finer granularity of model partitioning and memory optimization while facilitating automatic code generation. Based on observed memory usage characteristics, we derive a performance-optimal theorem for pipeline parallel partitioning that substantially reduces the partition search space. Secondly, we propose a binary pipeline partitioning algorithm and utilize a cost-model based memory optimization approach to efficiently identify nearly optimal pipeline parallel strategy. DawnPiper achieves up to a 4x and 11x increase in trainable maximum batch size compared to vPipe and PipeDream, respectively, and provides up to a 1.5x performance speedup compared to vPipe.

摘要

流水线并行是大规模模型训练的关键范式。然而，各阶段内存占用的不均衡会导致显著的GPU内存浪费，从而限制流水线并行能有效支持的模型规模。本文提出DawnPiper，一种内存可扩展的流水线并行训练框架。首先，我们开发了一种基于深度学习编译的分析方法，将模型转化为细粒度计算图。这种优化使我们能以更精细的粒度进行模型划分和内存优化，同时便于自动代码生成。基于观察到的内存使用特征，我们推导出流水线并行划分的性能最优定理，大幅缩减了划分搜索空间。其次，我们提出一种二分流水线划分算法，并采用基于成本模型的内存优化方法，高效识别接近最优的流水线并行策略。与vPipe和PipeDream相比，DawnPiper可实现最大可训练批量大小分别提升4倍和11倍，同时相比vPipe提供最高1.5倍的性能加速。

APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

Abstract

arXiv:2505.05758v1 Announce Type: new Abstract: Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair via LLM and Lean cOllaboration), a modular, model-agnostic pipeline that combines the strengths of the Lean compiler with an LLM's reasoning abilities to achieve better proof-generation results at a low sampling budget. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub-lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top-K budget. The repaired sub-proofs are recombined and reverified, iterating up to a user-controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state-of-the-art accuracy of 75.0% among 7B-parameter models while keeping the sampling budget below one thousand. Moreover, Apollo raises the state-of-the-art accuracy for Goedel-Prover-SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. General-purpose models (o3-mini, o4-mini) jump from 3-7% to over 40% accuracy. Our results demonstrate that targeted, compiler-guided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving.

摘要

形式化推理与自动定理证明是机器学习中一个具有挑战性的子领域，其目标是让机器使用Lean等形式化语言来证明数学定理。形式化验证系统可以近乎即时地检验形式化证明的正确性，但利用大语言模型（LLM）生成完全正确的形式化证明仍是一项艰巨任务。现有文献通常采用的方法是多次提示LLM（多达数千次），直到生成的证明之一通过验证系统。本研究提出APOLLO（基于LLM与Lean协作的自动证明修复框架），这是一种模块化、模型无关的流程，通过结合Lean编译器的优势与LLM的推理能力，在低采样预算下实现更优的证明生成效果。APOLLO引导全自动化流程：LLM生成定理证明，代理组分析证明、修复语法错误、利用Lean识别证明缺陷、隔离失败子引理、调用自动求解器，并在每个剩余目标上以低Top-K预算调用LLM。修复后的子证明经重组和重新验证，迭代次数受用户控制。在miniF2F基准测试中，我们在7B参数模型上实现了75.0%的最新准确率，同时保持采样次数低于一千次。此外，APOLLO将Goedel-Prover-SFT的准确率提升至65.6%，同时将采样复杂度从25,600次降至数百次。通用模型（o3-mini、o4-mini）的准确率从3-7%跃升至40%以上。研究结果表明，针对LLM输出进行编译器引导的定向修复，能显著提升效率与正确性，这为可扩展的自动定理证明提供了通用范式。

ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

Abstract

arXiv:2505.06020v1 Announce Type: new Abstract: Understanding visual art requires reasoning across multiple perspectives -- cultural, historical, and stylistic -- beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.

摘要

理解视觉艺术需要跨越文化、历史和风格等多重视角进行推理，而不仅仅是对象识别。尽管当前的多模态大语言模型（MLLMs）在通用图像描述任务中表现良好，但它们往往难以捕捉精细艺术所需的微妙诠释。我们提出ArtRAG——一种无需训练的新型框架，通过结合结构化知识与检索增强生成（RAG）技术来实现多视角艺术品阐释。该框架自动从领域特定文本源构建艺术语境知识图谱（ACKG），将艺术家、流派、主题和历史事件等实体组织成丰富可解释的图结构。在推理阶段，多粒度结构化检索器会选择语义和拓扑相关的子图来引导生成，使MLLMs能产出基于语境、蕴含文化背景的艺术描述。在SemArt和Artpedia数据集上的实验表明，ArtRAG优于多个经过充分训练的基线模型。人类评估进一步证实，ArtRAG生成的阐释具有连贯性、洞察力及文化深度。

Free and Fair Hardware: A Pathway to Copyright Infringement-Free Verilog Generation using LLMs

Abstract

arXiv:2505.06096v1 Announce Type: new Abstract: Limitations in Large Language Model (LLM) capabilities for hardware design tasks, such as generating functional Verilog codes, have motivated various fine-tuning optimizations utilizing curated hardware datasets from open-source repositories. However, these datasets remain limited in size and contain minimal checks on licensing for reuse, resulting in potential copyright violations by fine-tuned LLMs. Therefore, we propose an evaluation benchmark to estimate the risk of Verilog-trained LLMs to generate copyright-protected codes. To minimize this risk, we present an open-source Verilog dataset, FreeSet, containing over 220k files, along with the automated dataset curation framework utilized to provide additional guarantees of fair-use Verilog data. We then execute an LLM fine-tuning framework consisting of continual pre-training, resulting in a fine-tuned Llama model for Verilog, FreeV. Our results indicate that FreeV demonstrates the smallest risk of copyright-infringement among prior works, with only a 3% violation rate. Furthermore, experimental results demonstrate improvements in Verilog generation functionality over its baseline model, improving VerilogEval pass@10 rates by over 10%.

摘要

大型语言模型（LLM）在硬件设计任务（如生成功能性Verilog代码）中的能力局限，促使研究者利用开源仓库中的精选硬件数据集进行各种微调优化。然而，这些数据集规模仍然有限，且对重用许可的审查不足，导致微调后的LLM可能存在版权侵权风险。为此，我们提出一个评估基准，用于估算经过Verilog训练的LLM生成受版权保护代码的风险。为最小化这一风险，我们发布了一个开源Verilog数据集FreeSet，包含超过22万份文件，并提供了自动化数据集筛选框架以确保数据的合理使用。随后，我们实施了一个包含持续预训练的LLM微调框架，最终得到针对Verilog微调的Llama模型FreeV。结果表明，FreeV在现有工作中展现出最低的版权侵权风险，违规率仅为3%。此外，实验结果显示其在Verilog生成功能上优于基线模型，将VerilogEval pass@10率提升了10%以上。

Looking Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models

Abstract

arXiv:2505.05626v1 Announce Type: cross Abstract: Achieving deep alignment between vision and language remains a central challenge for Multimodal Large Language Models (MLLMs). These models often fail to fully leverage visual input, defaulting to strong language priors. Our approach first provides insights into how MLLMs internally build visual understanding of image regions and then introduces techniques to amplify this capability. Specifically, we explore techniques designed both to deepen the model's understanding of visual content and to ensure that these visual insights actively guide language generation. We demonstrate the superior multimodal understanding of our resultant model through a detailed upstream analysis quantifying its ability to predict visually-dependent tokens as well as 10 pt boost on visually challenging tasks.

摘要

实现视觉与语言的深度对齐仍是多模态大语言模型(MLLMs)面临的核心挑战。现有模型往往未能充分利用视觉输入，而是依赖于强大的语言先验。本研究首先揭示了MLLMs如何在内部建立对图像区域的视觉理解，进而提出增强该能力的技术方案。具体而言，我们探索了两种关键技术：一是深化模型对视觉内容的理解能力，二是确保这些视觉认知能主动引导语言生成。通过上游分析量化模型预测视觉相关词汇的能力，以及在视觉挑战性任务上10个百分点的性能提升，我们验证了最终模型具有更优越的多模态理解能力。

Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval

Abstract

arXiv:2505.05666v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) has become a popular technique for enhancing the reliability and utility of Large Language Models (LLMs) by grounding responses in external documents. Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text. However, even state-of-the-art OCRs can introduce errors, especially in degraded or complex documents. Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR. This study presents a systematic comparison between a vision-based RAG system (ColPali) and more traditional OCR-based pipelines utilizing Llama 3.2 (90B) and Nougat OCR across varying document qualities. Beyond conventional retrieval accuracy metrics, we introduce a semantic answer evaluation benchmark to assess end-to-end question-answering performance. Our findings indicate that while vision-based RAG performs well on documents it has been fine-tuned on, OCR-based RAG is better able to generalize to unseen documents of varying quality. We highlight the key trade-offs between computational efficiency and semantic accuracy, offering practical guidance for RAG practitioners in selecting between OCR-dependent and vision-based document retrieval systems in production environments.

摘要

检索增强生成（RAG）通过将回答基于外部文档，已成为提高大语言模型（LLM）可靠性和实用性的流行技术。传统RAG系统依赖光学字符识别（OCR）先将扫描文档处理为文本。然而，即使最先进的OCR也可能引入错误，尤其是在质量较差或复杂的文档中。最近的视觉语言方法（如ColPali）提出直接对文档进行视觉嵌入，从而无需OCR。本研究系统比较了基于视觉的RAG系统（ColPali）与更传统的基于OCR的流程（使用Llama 3.2（90B）和Nougat OCR）在不同文档质量下的表现。除了传统的检索准确性指标外，我们还引入了一个语义回答评估基准来评估端到端问答性能。我们的研究结果表明，虽然基于视觉的RAG在其微调过的文档上表现良好，但基于OCR的RAG能更好地泛化到不同质量的未见文档。我们强调了计算效率与语义准确性之间的关键权衡，为RAG实践者在生产环境中选择依赖OCR或基于视觉的文档检索系统提供了实用指导。

Assessing Robustness to Spurious Correlations in Post-Training Language Models

Abstract

arXiv:2505.05704v1 Announce Type: cross Abstract: Supervised and preference-based fine-tuning techniques have become popular for aligning large language models (LLMs) with user intent and correctness criteria. However, real-world training data often exhibits spurious correlations -- arising from biases, dataset artifacts, or other "shortcut" features -- that can compromise a model's performance or generalization. In this paper, we systematically evaluate three post-training algorithms -- Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and KTO (Kahneman-Tversky Optimization) -- across a diverse set of synthetic tasks and spuriousness conditions. Our tasks span mathematical reasoning, constrained instruction-following, and document-grounded question answering. We vary the degree of spurious correlation (10% vs. 90%) and investigate two forms of artifacts: "Feature Ambiguity" and "Distributional Narrowness." Our results show that the models often but not always degrade under higher spuriousness. The preference-based methods (DPO/KTO) can demonstrate relative robustness in mathematical reasoning tasks. By contrast, SFT maintains stronger performance in complex, context-intensive tasks. These findings highlight that no single post-training strategy universally outperforms in all scenarios; the best choice depends on the type of target task and the nature of spurious correlations.

摘要

监督学习和基于偏好的微调技术已成为使大语言模型（LLMs）与用户意图及正确性标准对齐的主流方法。然而，现实训练数据中普遍存在的虚假相关性——源于偏差、数据集伪影或其他"捷径"特征——可能损害模型性能或泛化能力。本文系统评估了三种训练后算法（监督微调[SFT]、直接偏好优化[DPO]和卡尼曼-特沃斯基优化[KTO]）在多样化合成任务与虚假相关性条件下的表现。实验任务涵盖数学推理、受限指令遵循及文档 grounded 问答三大类。我们通过调节虚假相关性强度（10% vs. 90%）并考察"特征模糊性"与"分布狭窄性"两种伪影形式发现：模型在高虚假相关性条件下通常（但非绝对）表现劣化。基于偏好的方法（DPO/KTO）在数学推理任务中展现出相对鲁棒性，而SFT在复杂语境密集型任务中保持更强性能。这些结果表明：不存在适用于所有场景的最优训练后策略，最佳选择取决于目标任务类型及虚假相关性的具体特性。

Adaptive Stress Testing Black-Box LLM Planners

Abstract

arXiv:2505.05665v1 Announce Type: cross Abstract: Large language models (LLMs) have recently demonstrated success in generalizing across decision-making tasks including planning, control and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. We argue that detecting such failures is necessary, especially in safety-critical scenarios. Existing black-box methods often detect hallucinations by identifying inconsistencies across multiple samples. Many of these approaches typically introduce prompt perturbations like randomizing detail order or generating adversarial inputs, with the intuition that a confident model should produce stable outputs. We first perform a manual case study showing that other forms of perturbations (e.g., adding noise, removing sensor details) cause LLMs to hallucinate in a driving environment. We then propose a novel method for efficiently searching the space of prompt perturbations using Adaptive Stress Testing (AST) with Monte-Carlo Tree Search (MCTS). Our AST formulation enables discovery of scenarios and prompts that cause language models to act with high uncertainty. By generating MCTS prompt perturbation trees across diverse scenarios, we show that offline analyses can be used at runtime to automatically generate prompts that influence model uncertainty, and to inform real-time trust assessments of an LLM.

摘要

大型语言模型（LLMs）近期在规划、控制和预测等决策任务中展现出卓越的跨任务泛化能力，但其产生不安全与不良输出的幻觉倾向会带来风险。我们认为在安全关键场景中，检测此类故障尤为必要。现有黑盒方法通常通过识别多个样本间的不一致性来检测幻觉，这些方法大多采用提示扰动策略（如随机化细节顺序或生成对抗性输入），其依据在于置信度高的模型应输出稳定结果。我们首先通过人工案例研究表明，其他形式的扰动（例如添加噪声、移除传感器细节）会导致LLMs在驾驶环境中产生幻觉。随后提出一种创新方法，利用蒙特卡洛树搜索（MCTS）的自适应压力测试（AST）高效搜索提示扰动空间。我们的AST框架能够发现导致语言模型高不确定性行为的场景与提示。通过在不同场景中生成MCTS提示扰动树，我们证明离线分析可在运行时自动生成影响模型不确定性的提示，并为LLM的实时可信度评估提供依据。

Multi-Agent Systems for Robotic Autonomy with LLMs

Abstract

arXiv:2505.05762v1 Announce Type: cross Abstract: Since the advent of Large Language Models (LLMs), various research based on such models have maintained significant academic attention and impact, especially in AI and robotics. In this paper, we propose a multi-agent framework with LLMs to construct an integrated system for robotic task analysis, mechanical design, and path generation. The framework includes three core agents: Task Analyst, Robot Designer, and Reinforcement Learning Designer. Outputs are formatted as multimodal results, such as code files or technical reports, for stronger understandability and usability. To evaluate generalizability comparatively, we conducted experiments with models from both GPT and DeepSeek. Results demonstrate that the proposed system can design feasible robots with control strategies when appropriate task inputs are provided, exhibiting substantial potential for enhancing the efficiency and accessibility of robotic system development in research and industrial applications.

摘要

自大型语言模型（LLMs）问世以来，基于此类模型的各类研究持续保持着重要的学术关注度和影响力，尤其在人工智能与机器人领域。本文提出一种基于LLM的多智能体框架，用于构建机器人任务分析、机械设计与路径生成的集成系统。该框架包含三个核心智能体：任务分析师、机器人设计器和强化学习设计器。输出结果以代码文件或技术报告等多模态形式呈现，以增强可理解性与实用性。为对比评估泛化能力，我们采用GPT和DeepSeek系列模型进行了实验。结果表明，在提供适当任务输入时，本系统能够设计出具备控制策略的可行机器人方案，展现出提升科研与工业应用中机器人系统开发效率及可及性的巨大潜力。

MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

Abstract

arXiv:2505.05799v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at https://github.com/cat538/MxMoE.

摘要

混合专家（MoE）模型因其庞大的参数量与计算需求面临部署挑战。本研究探索MoE模型的量化技术，揭示两个关键发现：1）线性模块表现出不同的量化敏感性；2）专家激活频率的差异导致异构计算特征。基于这些观察，我们提出MxMoE——一个兼顾算法与系统视角的MoE模型混合精度优化框架。MxMoE通过权衡参数敏感性、专家激活动态与硬件资源所定义的设计空间，推导高效的混合精度配置方案。此外，该框架能自动生成优化的混合精度GroupGEMM内核，支持不同精度GEMM的并行执行。实验表明，MxMoE在2.25比特下比GPTQ降低2.4个Wikitext-2困惑度，相比全精度实现最高达3.4倍加速，在5比特权重-激活量化条件下，较均匀量化在同等精度下实现最高29.4%的加速效果。代码已开源：https://github.com/cat538/MxMoE。

What Is Next for LLMs? Next-Generation AI Computing Hardware Using Photonic Chips

Abstract

arXiv:2505.05794v1 Announce Type: cross Abstract: Large language models (LLMs) are rapidly pushing the limits of contemporary computing hardware. For example, training GPT-3 has been estimated to consume around 1300 MWh of electricity, and projections suggest future models may require city-scale (gigawatt) power budgets. These demands motivate exploration of computing paradigms beyond conventional von Neumann architectures. This review surveys emerging photonic hardware optimized for next-generation generative AI computing. We discuss integrated photonic neural network architectures (e.g., Mach-Zehnder interferometer meshes, lasers, wavelength-multiplexed microring resonators) that perform ultrafast matrix operations. We also examine promising alternative neuromorphic devices, including spiking neural network circuits and hybrid spintronic-photonic synapses, which combine memory and processing. The integration of two-dimensional materials (graphene, TMDCs) into silicon photonic platforms is reviewed for tunable modulators and on-chip synaptic elements. Transformer-based LLM architectures (self-attention and feed-forward layers) are analyzed in this context, identifying strategies and challenges for mapping dynamic matrix multiplications onto these novel hardware substrates. We then dissect the mechanisms of mainstream LLMs, such as ChatGPT, DeepSeek, and LLaMA, highlighting their architectural similarities and differences. We synthesize state-of-the-art components, algorithms, and integration methods, highlighting key advances and open issues in scaling such systems to mega-sized LLM models. We find that photonic computing systems could potentially surpass electronic processors by orders of magnitude in throughput and energy efficiency, but require breakthroughs in memory, especially for long-context windows and long token sequences, and in storage of ultra-large datasets.

摘要

大型语言模型（LLMs）正迅速逼近当代计算硬件的性能极限。例如，训练GPT-3的电力消耗估计约为1300兆瓦时，预测表明未来模型可能需要城市规模（吉瓦级）的电力预算。这些需求促使人们探索超越传统冯·诺伊曼架构的计算范式。本文综述了面向下一代生成式AI计算优化的新兴光子硬件技术。我们讨论了执行超快矩阵运算的集成光子神经网络架构（如马赫-曾德尔干涉仪阵列、激光器、波长复用微环谐振器），并研究了有前景的替代性神经形态器件，包括脉冲神经网络电路和兼具存储与处理功能的混合自旋电子-光子突触。同时回顾了将二维材料（石墨烯、过渡金属二硫属化合物）集成至硅光子平台以实现可调谐调制器与片上突触元件的研究。在此框架下，我们分析了基于Transformer的LLM架构（自注意力层与前馈层），提出了将动态矩阵运算映射到这些新型硬件基底的策略与挑战。随后剖析了ChatGPT、DeepSeek和LLaMA等主流LLM的运行机制，着重比较其架构异同。通过综合前沿组件、算法与集成方法，我们阐明了在向超大规模LLM扩展过程中取得的关键进展与待解决问题。研究发现光子计算系统在吞吐量与能效方面可能超越电子处理器数个数量级，但需在存储器（尤其是长上下文窗口与长令牌序列场景）以及超大规模数据集存储方面实现突破。

AgentXploit: End-to-End Redteaming of Black-Box AI Agents

Abstract

arXiv:2505.05849v1 Announce Type: cross Abstract: The strong planning and reasoning capabilities of Large Language Models (LLMs) have fostered the development of agent-based systems capable of leveraging external tools and interacting with increasingly complex environments. However, these powerful features also introduce a critical security risk: indirect prompt injection, a sophisticated attack vector that compromises the core of these agents, the LLM, by manipulating contextual information rather than direct user prompts. In this work, we propose a generic black-box fuzzing framework, AgentXploit, designed to automatically discover and exploit indirect prompt injection vulnerabilities across diverse LLM agents. Our approach starts by constructing a high-quality initial seed corpus, then employs a seed selection algorithm based on Monte Carlo Tree Search (MCTS) to iteratively refine inputs, thereby maximizing the likelihood of uncovering agent weaknesses. We evaluate AgentXploit on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o, respectively, nearly doubling the performance of baseline attacks. Moreover, AgentXploit exhibits strong transferability across unseen tasks and internal LLMs, as well as promising results against defenses. Beyond benchmark evaluations, we apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.

摘要

大型语言模型（LLMs）强大的规划与推理能力推动了基于代理的系统发展，这些系统能够利用外部工具并与日益复杂的环境交互。然而，这些强大特性也带来了关键安全风险：间接提示注入攻击。这种复杂攻击手段通过操纵上下文信息（而非直接用户提示）来攻陷这些代理的核心——LLM模型。本研究提出通用黑盒模糊测试框架AgentXploit，可自动发现并利用各类LLM代理中的间接提示注入漏洞。该方法首先构建高质量初始种子语料库，随后采用基于蒙特卡洛树搜索（MCTS）的种子选择算法迭代优化输入，从而最大化发现代理弱点的概率。我们在AgentDojo和VWA-adv两个公开基准上评估AgentXploit，其针对基于o3-mini和GPT-4o构建的代理分别达到71%和70%的成功率，较基线攻击性能提升近一倍。此外，AgentXploit在未见任务与内部LLM间展现出强迁移性，对防御机制也表现出良好效果。除基准测试外，我们在真实环境中实施攻击，成功诱导代理导航至包括恶意网站在内的任意URL。

Evolutionary thoughts: integration of large language models and evolutionary algorithms

Abstract

arXiv:2505.05756v1 Announce Type: cross Abstract: Large Language Models (LLMs) have unveiled remarkable capabilities in understanding and generating both natural language and code, but LLM reasoning is prone to hallucination and struggle with complex, novel scenarios, often getting stuck on partial or incorrect solutions. However, the inherent ability of Evolutionary Algorithms (EAs) to explore extensive and complex search spaces makes them particularly effective in scenarios where traditional optimization methodologies may falter. However, EAs explore a vast search space when applied to complex problems. To address the computational bottleneck of evaluating large populations, particularly crucial for complex evolutionary tasks, we introduce a highly efficient evaluation framework. This implementation maintains compatibility with existing primitive definitions, ensuring the generation of valid individuals. Using LLMs, we propose an enhanced evolutionary search strategy that enables a more focused exploration of expansive solution spaces. LLMs facilitate the generation of superior candidate solutions, as evidenced by empirical results demonstrating their efficacy in producing improved outcomes.

摘要

大语言模型（LLMs）在理解和生成自然语言及代码方面展现出卓越能力，但其推理过程易产生幻觉，且在复杂新颖场景中表现欠佳，常陷入局部或错误解。而进化算法（EAs）凭借其探索广阔复杂搜索空间的固有特性，在传统优化方法失效的场景中表现尤为突出。然而，当应用于复杂问题时，进化算法需遍历庞大的搜索空间。为缓解评估大规模种群的计算瓶颈（这对复杂进化任务至关重要），我们提出了一种高效评估框架。该实现保持与现有原始定义的兼容性，确保生成有效个体。基于大语言模型，我们提出一种增强型进化搜索策略，实现对广阔解空间更聚焦的探索。实证结果表明，大语言模型能有效生成更优候选解，其提升解决方案质量的能力得到验证。

Evolutionary ecology of words

Abstract

arXiv:2505.05863v1 Announce Type: cross Abstract: We propose a model for the evolutionary ecology of words as one attempt to extend evolutionary game theory and agent-based models by utilizing the rich linguistic expressions of Large Language Models (LLMs). Our model enables the emergence and evolution of diverse and infinite options for interactions among agents. Within the population, each agent possesses a short word (or phrase) generated by an LLM and moves within a spatial environment. When agents become adjacent, the outcome of their interaction is determined by the LLM based on the relationship between their words, with the loser's word being replaced by the winner's. Word mutations, also based on LLM outputs, may occur. We conducted preliminary experiments assuming that ``strong animal species" would survive. The results showed that from an initial population consisting of well-known species, many species emerged both gradually and in a punctuated equilibrium manner. Each trial demonstrated the unique evolution of diverse populations, with one type of large species becoming dominant, such as terrestrial animals, marine life, or extinct species, which were ecologically specialized and adapted ones across diverse extreme habitats. We also conducted a long-term experiment with a large population, demonstrating the emergence and coexistence of diverse species.

摘要

我们提出了一种词汇进化生态学模型，旨在通过利用大型语言模型（LLM）丰富的语言表达能力，扩展演化博弈论和基于智能体的建模方法。该模型能够实现智能体间多样化且无限交互选项的涌现与演化。在群体中，每个智能体拥有一个由LLM生成的短词（或短语），并在空间环境中移动。当智能体相邻时，其交互结果由LLM根据双方词汇的语义关系判定，败者的词汇将被胜者取代。基于LLM输出的词汇变异也可能发生。我们以"强势动物物种将存活"为假设进行了初步实验，结果表明：从由知名物种组成的初始群体出发，许多物种以渐进式和间断平衡式两种模式涌现。每次实验都展现出独特且多样化的种群演化路径，其中一类大型物种（如陆生动物、海洋生物或已灭绝物种）会占据主导地位，这些物种均是在多样化极端栖息环境中经过生态特化与适应的产物。我们还开展了大规模群体的长期实验，证实了多样化物种的涌现与共存现象。

Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2

Abstract

arXiv:2505.05946v1 Announce Type: cross Abstract: This technical report describes an experiment on autoregressive pre-training of Gemma2 2 billion parameter large language model (LLM) with 10% on the Lithuanian language component of CulturaX from the point of view of continual learning. We apply elastic weight consolidation (EWC) to the full set of the model's parameters and investigate language understanding benchmarks, consisting of Arc, Belebele, Gsm8K, Hellaswag, MMLU, TruthfulQA, and Winogrande sets (both in English and Lithuanian versions), and perplexity benchmarks. We empirically demonstrate that EWC regularisation allows us not only to mitigate catastrophic forgetting effects but also that it is potentially beneficial for learning of the new task with LLMs.

摘要

本技术报告从持续学习的角度，描述了在CulturaX数据集立陶宛语占比10%的条件下，对Gemma2 20亿参数大语言模型(LLM)进行自回归预训练的实验。我们采用弹性权重固化(EWC)方法对模型全部参数进行处理，并评估了包括Arc、Belebele、Gsm8K、Hellaswag、MMLU、TruthfulQA和Winogrande测试集（英语和立陶宛语版本）在内的语言理解基准，以及困惑度指标。实验结果表明，EWC正则化不仅能有效缓解灾难性遗忘现象，还可能对大语言模型学习新任务产生积极影响。

Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

Abstract

arXiv:2505.06085v1 Announce Type: cross Abstract: The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware architectures that optimize computational efficiency and energy consumption. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced numerical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull's execution model, gridsize, matrix dimensions, data formats, and numerical precision impact computational efficiency. Furthermore, we compare Grayskull's performance against state-of-the-art architectures with tensor acceleration, including Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-off between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.

摘要

随着大型语言模型（LLMs）作为生成式AI服务的需求日益增长，对优化计算效率和能耗的专用硬件架构的需求也随之增加。本文评估了Tenstorrent Grayskull e75 RISC-V加速器在降低数值精度下对基本线性代数核（LLM计算中的基础操作）的性能表现。我们详细分析了Grayskull的执行模型、网格规模、矩阵维度、数据格式及数值精度对计算效率的影响。此外，我们将Grayskull的性能与支持张量加速的先进架构进行了对比，包括Intel Sapphire Rapids处理器和两款NVIDIA GPU（V100与A100）。尽管NVIDIA GPU在原始性能上占据优势，但Grayskull在功耗与计算吞吐量之间展现出具有竞争力的平衡，其BF16精度下的峰值能效达到1.55 TFLOPs/瓦。

LLMs Outperform Experts on Challenging Biology Benchmarks

Abstract

arXiv:2505.06108v1 Announce Type: cross Abstract: This study systematically evaluates 27 frontier Large Language Models on eight diverse biology benchmarks spanning molecular biology, genetics, cloning, virology, and biosecurity. Models from major AI developers released between November 2022 and April 2025 were assessed through ten independent runs per benchmark. The findings reveal dramatic improvements in biological capabilities. Top model performance increased more than 4-fold on the challenging text-only subset of the Virology Capabilities Test over the study period, with the top model now performing twice as well as expert virologists. Several models now match or exceed expert-level performance on other challenging benchmarks, including LAB-Bench CloningScenarios and the biology subsets of GPQA and WMDP. Contrary to expectations, chain-of-thought did not substantially improve performance over zero-shot evaluation, while extended reasoning features in o3-mini and Claude 3.7 Sonnet typically improved performance as predicted by inference scaling. Benchmarks such as PubMedQA and the MMLU and WMDP biology subsets exhibited performance plateaus well below 100%, suggesting benchmark saturation and errors in the underlying benchmark data. The analysis highlights the need for more sophisticated evaluation methodologies as AI systems continue to advance.

摘要

本研究系统评估了27个前沿大语言模型在分子生物学、遗传学、克隆技术、病毒学及生物安全等八个多样化生物学基准测试中的表现。针对2022年11月至2025年4月期间主流AI开发商发布的模型，每个基准测试均进行十次独立运行。研究结果显示模型生物学能力取得显著提升：在研究期间，顶尖模型在病毒学能力测试纯文本子集上的表现提升逾4倍，当前最优模型性能已达病毒学专家水平的2倍。多个模型在LAB-Bench克隆场景、GPQA和WMDP生物学子集等其他挑战性基准测试中已达到或超越专家水平。与预期相反，思维链推理相较零样本评估并未显著提升性能，而o3-mini和Claude 3.7 Sonnet的扩展推理功能则如推断缩放预测般普遍提升了表现。PubMedQA及MMLU、WMDP生物学子集等基准测试表现出远低于100%的性能平台期，暗示基准数据存在饱和现象及底层错误。该分析强调随着AI系统持续发展，需要建立更 sophisticated 的评估方法论。

Turbo-ICL: In-Context Learning-Based Turbo Equalization

Abstract

arXiv:2505.06175v1 Announce Type: cross Abstract: This paper introduces a novel in-context learning (ICL) framework, inspired by large language models (LLMs), for soft-input soft-output channel equalization in coded multiple-input multiple-output (MIMO) systems. The proposed approach learns to infer posterior symbol distributions directly from a prompt of pilot signals and decoder feedback. A key innovation is the use of prompt augmentation to incorporate extrinsic information from the decoder output as additional context, enabling the ICL model to refine its symbol estimates iteratively across turbo decoding iterations. Two model variants, based on Transformer and state-space architectures, are developed and evaluated. Extensive simulations demonstrate that, when traditional linear assumptions break down, e.g., in the presence of low-resolution quantization, ICL equalizers consistently outperform conventional model-based baselines, even when the latter are provided with perfect channel state information. Results also highlight the advantage of Transformer-based models under limited training diversity, as well as the efficiency of state-space models in resource-constrained scenarios.

摘要

本文提出了一种受大语言模型（LLM）启发的新型上下文学习（ICL）框架，用于编码多输入多输出（MIMO）系统中的软输入软输出信道均衡。该方法通过导频信号与解码器反馈构成的提示信息，直接学习推断后验符号分布。其核心创新在于采用提示增强技术，将解码器输出的外部信息作为额外上下文，使ICL模型能在Turbo解码迭代过程中逐步优化符号估计。研究开发并评估了基于Transformer和状态空间架构的两种模型变体。大量仿真表明，当传统线性假设失效时（如存在低分辨率量化），ICL均衡器始终优于基于模型的传统基线方法，即使后者具备完美信道状态信息。实验结果同时揭示了Transformer模型在训练多样性受限时的优势，以及状态空间模型在资源受限场景下的高效性。

A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets

Abstract

arXiv:2505.06150v1 Announce Type: cross Abstract: We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of examples and their average token length -- what we term \emph{dataset volume} -- play a decisive role in model performance. Our formulation is tuned following established procedures. Experiments on the BRICC dataset \cite{salavati2024reducing} and subsets of the MMLU dataset \cite{hendrycks2021measuringmassivemultitasklanguage}, evaluated under multiple subsampling strategies, reveal that data composition significantly affects token efficiency. These results motivate refined scaling laws for practical LLM fine-tuning in resource-constrained settings.

摘要

我们提出了一种在固定计算预算下微调大语言模型（LLM）的缩放定律，该定律明确考虑了数据构成因素。传统方法仅通过总token数量来衡量训练数据，而样本数量及其平均token长度（我们称之为\emph{数据集容量}）对模型性能具有决定性影响。该公式的调整遵循既定流程。在BRICC数据集\cite{salavati2024reducing}和MMLU数据集\cite{hendrycks2021measuringmassivemultitasklanguage}子集上进行的实验表明，采用多种子采样策略评估时，数据构成会显著影响token效率。这些结果为资源受限环境下实际LLM微调的精细化缩放定律提供了理论依据。

AGITB: A Signal-Level Benchmark for Evaluating Artificial General Intelligence

Abstract

arXiv:2504.04430v3 Announce Type: replace Abstract: Despite remarkable progress in machine learning, current AI systems continue to fall short of true human-like intelligence. While Large Language Models (LLMs) excel in pattern recognition and response generation, they lack genuine understanding - an essential hallmark of Artificial General Intelligence (AGI). Existing AGI evaluation methods fail to offer a practical, gradual, and informative metric. This paper introduces the Artificial General Intelligence Test Bed (AGITB), comprising twelve rigorous tests that form a signal-processing-level foundation for the potential emergence of cognitive capabilities. AGITB evaluates intelligence through a model's ability to predict binary signals across time without relying on symbolic representations or pretraining. Unlike high-level tests grounded in language or perception, AGITB focuses on core computational invariants reflective of biological intelligence, such as determinism, sensitivity, and generalisation. The test bed assumes no prior bias, operates independently of semantic meaning, and ensures unsolvability through brute force or memorization. While humans pass AGITB by design, no current AI system has met its criteria, making AGITB a compelling benchmark for guiding and recognizing progress toward AGI.

摘要

尽管机器学习取得了显著进展，当前人工智能系统仍未能实现真正类人的智能。虽然大语言模型（LLMs）在模式识别和应答生成方面表现优异，但它们缺乏真正的理解能力——这正是通用人工智能（AGI）的核心特征。现有AGI评估方法无法提供实用、渐进且信息丰富的度量标准。本文提出通用人工智能测试平台（AGITB），该平台包含十二项严格测试，为认知能力可能涌现构建了信号处理层面的基础框架。AGITB通过模型在不依赖符号表征或预训练的情况下跨时间预测二进制信号的能力来评估智能水平。与基于语言或感知的高层次测试不同，AGITB聚焦反映生物智能核心的计算不变性特征，如确定性、敏感性和泛化能力。该测试平台不预设先验偏置，独立于语义含义运作，并确保无法通过暴力破解或记忆方式求解。虽然人类设计上能通过AGITB测试，但目前尚无AI系统达到其标准，这使得AGITB成为引导和识别AGI发展进程的理想基准。

The Typing Cure: Experiences with Large Language Model Chatbots for Mental Health Support

Abstract

arXiv:2401.14362v3 Announce Type: replace-cross Abstract: People experiencing severe distress increasingly use Large Language Model (LLM) chatbots as mental health support tools. Discussions on social media have described how engagements were lifesaving for some, but evidence suggests that general-purpose LLM chatbots also have notable risks that could endanger the welfare of users if not designed responsibly. In this study, we investigate the lived experiences of people who have used LLM chatbots for mental health support. We build on interviews with 21 individuals from globally diverse backgrounds to analyze how users create unique support roles for their chatbots, fill in gaps in everyday care, and navigate associated cultural limitations when seeking support from chatbots. We ground our analysis in psychotherapy literature around effective support, and introduce the concept of therapeutic alignment, or aligning AI with therapeutic values for mental health contexts. Our study offers recommendations for how designers can approach the ethical and effective use of LLM chatbots and other AI mental health support tools in mental health care.

摘要

经历严重心理困扰的人群日益将大型语言模型（LLM）聊天机器人作为心理健康支持工具。社交媒体讨论显示，这种互动对部分使用者具有挽救生命的作用，但证据表明通用型LLM聊天机器人若未经过负责任设计，也可能存在显著风险并危及用户福祉。本研究探讨了使用LLM聊天机器人获取心理健康支持者的真实体验。基于对21位全球多元背景个体的访谈，我们分析了用户如何为聊天机器人创造独特的支持角色、填补日常护理缺口，以及在寻求聊天机器人支持时应对相关文化限制。研究以心理治疗领域关于有效支持的理论为基础，提出'治疗性对齐'概念——即在心理健康场景中将人工智能与治疗价值观对齐。本研究为设计者如何伦理且有效地运用LLM聊天机器人及其他人工智能心理健康支持工具提供了实践建议。

AI-Driven Scholarly Peer Review via Persistent Workflow Prompting, Meta-Prompting, and Meta-Reasoning

Abstract

arXiv:2505.03332v2 Announce Type: replace Abstract: Critical peer review of scientific manuscripts presents a significant challenge for Large Language Models (LLMs), partly due to data limitations and the complexity of expert reasoning. This report introduces Persistent Workflow Prompting (PWP), a potentially broadly applicable prompt engineering methodology designed to bridge this gap using standard LLM chat interfaces (zero-code, no APIs). We present a proof-of-concept PWP prompt for the critical analysis of experimental chemistry manuscripts, featuring a hierarchical, modular architecture (structured via Markdown) that defines detailed analysis workflows. We develop this PWP prompt through iterative application of meta-prompting techniques and meta-reasoning aimed at systematically codifying expert review workflows, including tacit knowledge. Submitted once at the start of a session, this PWP prompt equips the LLM with persistent workflows triggered by subsequent queries, guiding modern reasoning LLMs through systematic, multimodal evaluations. Demonstrations show the PWP-guided LLM identifying major methodological flaws in a test case while mitigating LLM input bias and performing complex tasks, including distinguishing claims from evidence, integrating text/photo/figure analysis to infer parameters, executing quantitative feasibility checks, comparing estimates against claims, and assessing a priori plausibility. To ensure transparency and facilitate replication, we provide full prompts, detailed demonstration analyses, and logs of interactive chats as supplementary resources. Beyond the specific application, this work offers insights into the meta-development process itself, highlighting the potential of PWP, informed by detailed workflow formalization, to enable sophisticated analysis using readily available LLMs for complex scientific tasks.

摘要

科学手稿的批判性同行评审对大型语言模型（LLMs）构成重大挑战，部分源于数据限制和专家推理的复杂性。本报告提出持续工作流提示法（PWP），这是一种可能具有广泛适用性的提示工程方法，旨在通过标准LLM聊天界面（零代码、无需API）来弥合这一差距。我们展示了一个用于实验化学手稿批判性分析的概念验证PWP提示，其采用分层模块化架构（通过Markdown结构化），定义了详细的分析工作流程。该PWP提示通过迭代应用元提示技术和元推理开发而成，旨在系统化编码专家评审工作流程（包括隐性知识）。在会话开始时提交一次该PWP提示，即可为LLM配备由后续查询触发的持续工作流，引导现代推理型LLM进行系统性多模态评估。演示表明，PWP引导的LLM在测试案例中能识别主要方法缺陷，同时缓解LLM输入偏差，并执行复杂任务，包括区分主张与证据、整合文本/照片/图表分析以推断参数、执行定量可行性检查、将估算值与主张对比以及评估先验合理性。为确保透明度并便于复现，我们提供了完整提示、详细演示分析记录和交互式聊天日志作为补充资源。除具体应用外，本研究还揭示了元开发过程本身，凸显了PWP通过精细化工作流形式化，利用现成LLM实现复杂科学任务高级分析的潜力。

CoverUp: Effective High Coverage Test Generation for Python

Abstract

arXiv:2403.16218v4 Announce Type: replace-cross Abstract: Testing is an essential part of software development. Test generation tools attempt to automate the otherwise labor-intensive task of test creation, but generating high-coverage tests remains challenging. This paper proposes CoverUp, a novel approach to driving the generation of high-coverage Python regression tests. CoverUp combines coverage analysis, code context, and feedback in prompts that iteratively guide the LLM to generate tests that improve line and branch coverage. We evaluate our prototype CoverUp implementation across a benchmark of challenging code derived from open-source Python projects and show that CoverUp substantially improves on the state of the art. Compared to CodaMosa, a hybrid search/LLM-based test generator, CoverUp achieves a per-module median line+branch coverage of 80% (vs. 47%). Compared to MuTAP, a mutation- and LLM-based test generator, CoverUp achieves an overall line+branch coverage of 89% (vs. 77%). We also demonstrate that CoverUp's performance stems not only from the LLM used but from the combined effectiveness of its components.

摘要

测试是软件开发中不可或缺的环节。测试生成工具旨在自动化这一原本费时费力的测试创建任务，但生成高覆盖率的测试仍具挑战性。本文提出CoverUp这一创新方法，用于驱动生成高覆盖率的Python回归测试。CoverUp通过结合覆盖率分析、代码上下文及提示反馈，迭代式引导大型语言模型生成能提升行覆盖率和分支覆盖率的测试用例。我们在源自开源Python项目的复杂代码基准集上评估CoverUp原型实现，结果表明该方法显著优于现有技术：相较于混合搜索/LLM的测试生成器CodaMosa，CoverUp在模块级别达到80%的行+分支覆盖率中位数（对比47%）；相比基于突变和LLM的测试生成器MuTAP，CoverUp总体行+分支覆盖率达89%（对比77%）。实验还证明CoverUp的优异性能不仅源于所采用的大型语言模型，更得益于其各组件协同作用带来的整体效能提升。

An Invitation to Deep Reinforcement Learning

Abstract

arXiv:2312.08365v3 Announce Type: replace-cross Abstract: Training a deep neural network to maximize a target objective has become the standard recipe for successful machine learning over the last decade. These networks can be optimized with supervised learning, if the target objective is differentiable. For many interesting problems, this is however not the case. Common objectives like intersection over union (IoU), bilingual evaluation understudy (BLEU) score or rewards cannot be optimized with supervised learning. A common workaround is to define differentiable surrogate losses, leading to suboptimal solutions with respect to the actual objective. Reinforcement learning (RL) has emerged as a promising alternative for optimizing deep neural networks to maximize non-differentiable objectives in recent years. Examples include aligning large language models via human feedback, code generation, object detection or control problems. This makes RL techniques relevant to the larger machine learning audience. The subject is, however, time intensive to approach due to the large range of methods, as well as the often very theoretical presentation. In this introduction, we take an alternative approach, different from classic reinforcement learning textbooks. Rather than focusing on tabular problems, we introduce reinforcement learning as a generalization of supervised learning, which we first apply to non-differentiable objectives and later to temporal problems. Assuming only basic knowledge of supervised learning, the reader will be able to understand state-of-the-art deep RL algorithms like proximal policy optimization (PPO) after reading this tutorial.

摘要

在过去十年中，通过训练深度神经网络以最大化目标指标已成为机器学习领域成功的标准方法。当目标函数可微时，这些网络可通过监督学习进行优化。然而，对于许多具有挑战性的问题（如交并比IoU、双语评估替补BLEU分数或奖励函数等指标），监督学习无法直接优化。常见的解决方案是设计可微的替代损失函数，但这往往会导致结果偏离实际目标的最优解。近年来，强化学习（RL）逐渐成为优化深度神经网络以最大化不可微目标的重要替代方案，其应用涵盖基于人类反馈的大语言模型对齐、代码生成、目标检测及控制问题等领域，这使得强化学习技术对更广泛的机器学习研究者具有重要价值。但由于方法体系庞大且理论阐述通常较为抽象，该领域的学习门槛较高。本导论采用不同于经典强化学习教材的路径：首先将强化学习作为监督学习的扩展框架引入，先应用于不可微目标优化，再延伸至时序问题。读者只需具备监督学习基础知识，即可通过本教程理解近端策略优化（PPO）等最先进的深度强化学习算法。

"Set It Up!": Functional Object Arrangement with Compositional Generative Models

Abstract

arXiv:2405.11928v3 Announce Type: replace-cross Abstract: This paper studies the challenge of developing robots capable of understanding under-specified instructions for creating functional object arrangements, such as "set up a dining table for two"; previous arrangement approaches have focused on much more explicit instructions, such as "put object A on the table." We introduce a framework, SetItUp, for learning to interpret under-specified instructions. SetItUp takes a small number of training examples and a human-crafted program sketch to uncover arrangement rules for specific scene types. By leveraging an intermediate graph-like representation of abstract spatial relationships among objects, SetItUp decomposes the arrangement problem into two subproblems: i) learning the arrangement patterns from limited data and ii) grounding these abstract relationships into object poses. SetItUp leverages large language models (LLMs) to propose the abstract spatial relationships among objects in novel scenes as the constraints to be satisfied; then, it composes a library of diffusion models associated with these abstract relationships to find object poses that satisfy the constraints. We validate our framework on a dataset comprising study desks, dining tables, and coffee tables, with the results showing superior performance in generating physically plausible, functional, and aesthetically pleasing object arrangements compared to existing models.

摘要

本文研究了开发能够理解模糊指令以创建功能性物体排列的机器人所面临的挑战，例如"布置一张两人用餐的餐桌"；而现有排列方法主要针对更明确的指令，如"将物体A放在桌子上"。我们提出了SetItUp框架，用于学习解释模糊指令。该框架通过少量训练示例和人工编写的程序草图，来发现特定场景类型的排列规则。通过利用物体间抽象空间关系的中间图状表示，SetItUp将排列问题分解为两个子问题：i) 从有限数据中学习排列模式；ii) 将这些抽象关系落实到物体位姿中。SetItUp利用大型语言模型(LLMs)提出新场景中物体间的抽象空间关系作为待满足约束条件，然后组合与这些抽象关系相关联的扩散模型库，以找到满足约束条件的物体位姿。我们在包含学习桌、餐桌和咖啡桌的数据集上验证了该框架，结果表明相较于现有模型，SetItUp在生成物理合理、功能完善且美观的物体排列方面表现出更优的性能。

Learning Algorithms Made Simple

Abstract

arXiv:2410.09186v2 Announce Type: replace-cross Abstract: In this paper, we discuss learning algorithms and their importance in different types of applications which includes training to identify important patterns and features in a straightforward, easy-to-understand manner. We will review the main concepts of artificial intelligence (AI), machine learning (ML), deep learning (DL), and hybrid models. Some important subsets of Machine Learning algorithms such as supervised, unsupervised, and reinforcement learning are also discussed in this paper. These techniques can be used for some important tasks like prediction, classification, and segmentation. Convolutional Neural Networks (CNNs) are used for image and video processing and many more applications. We dive into the architecture of CNNs and how to integrate CNNs with ML algorithms to build hybrid models. This paper explores the vulnerability of learning algorithms to noise, leading to misclassification. We further discuss the integration of learning algorithms with Large Language Models (LLM) to generate coherent responses applicable to many domains such as healthcare, marketing, and finance by learning important patterns from large volumes of data. Furthermore, we discuss the next generation of learning algorithms and how we may have an unified Adaptive and Dynamic Network to perform important tasks. Overall, this article provides brief overview of learning algorithms, exploring their current state, applications and future direction.

Recent Advances in Federated Learning Driven Large Language Models: A Survey on Architecture, Performance, and Security

Abstract

arXiv:2406.09831v2 Announce Type: replace-cross Abstract: Federated Learning (FL) offers a promising paradigm for training Large Language Models (LLMs) in a decentralized manner while preserving data privacy and minimizing communication overhead. This survey examines recent advancements in FL-driven LLMs, with a particular emphasis on architectural designs, performance optimization, and security concerns, including the emerging area of machine unlearning. In this context, machine unlearning refers to the systematic removal of specific data contributions from trained models to comply with privacy regulations such as the Right to be Forgotten. We review a range of strategies enabling unlearning in federated LLMs, including perturbation-based methods, model decomposition, and incremental retraining, while evaluating their trade-offs in terms of efficiency, privacy guarantees, and model utility. Through selected case studies and empirical evaluations, we analyze how these methods perform in practical FL scenarios. This survey identifies critical research directions toward developing secure, adaptable, and high-performing federated LLM systems for real-world deployment.

摘要

联邦学习（FL）为大规模语言模型（LLMs）的去中心化训练提供了一种前景广阔的范式，既能保护数据隐私，又能降低通信开销。本综述探讨了联邦学习驱动下LLMs的最新进展，重点分析了架构设计、性能优化以及安全问题（包括新兴的机器遗忘领域）。在此背景下，机器遗忘指为遵守《被遗忘权》等隐私法规，从已训练模型中系统移除特定数据贡献的技术。我们综述了实现联邦LLMs遗忘的一系列策略，包括基于扰动的方法、模型分解和增量式再训练，并评估了这些方法在效率、隐私保障和模型效用方面的权衡。通过精选案例研究和实证评估，我们分析了这些方法在实际联邦学习场景中的表现。本综述为开发安全、适应性强且高性能的联邦LLM系统指明了关键研究方向，以推动其在实际场景中的部署。

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

Abstract

arXiv:2406.09519v4 Announce Type: replace-cross Abstract: Although it is known that transformer language models (LMs) pass features from early layers to later layers, it is not well understood how this information is represented and routed by the model. We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task, and find that it underlies a commonly used abstraction across many context-retrieval behaviors. Specifically, we find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers, forming low-rank communication channels (Elhage et al., 2021) between layers. A particular 3D subspace in model activations in GPT-2 can be traversed to positionally index items in lists, and we show that this mechanism can explain an otherwise arbitrary-seeming sensitivity of the model to the order of items in the prompt. That is, the model has trouble copying the correct information from context when many items ``crowd" this limited space. By decomposing attention heads with the Singular Value Decomposition (SVD), we find that previously described interactions between heads separated by one or more layers can be predicted via analysis of their weight matrices alone. We show that it is possible to manipulate the internal model representations as well as edit model weights based on the mechanism we discover in order to significantly improve performance on our synthetic Laundry List task, which requires recall from a list, often improving task accuracy by over 20%. Our analysis reveals a surprisingly intricate interpretable structure learned from language model pretraining, and helps us understand why sophisticated LMs sometimes fail in simple domains, facilitating future analysis of more complex behaviors.

摘要

尽管已知Transformer语言模型（LMs）会将早期层的特征传递至后续层，但关于这些信息如何被模型表征和路由的机制尚未得到充分理解。我们分析了两种LMs在特定任务中选择性抑制上下文项的机制，发现该机制构成了多种上下文检索行为中常用抽象的基础。具体而言，我们发现模型通过将特征写入残差流的低秩子空间进行表征，随后由后续层读取，从而形成层间低秩通信通道（Elhage等，2021）。在GPT-2的模型激活中存在一个特定的三维子空间，通过遍历该空间可实现列表中项的位置索引；我们证明该机制可解释模型对提示项顺序表现出的看似任意敏感性——当多个项"挤占"这一有限空间时，模型难以从上下文中正确复制信息。通过奇异值分解（SVD）对注意力头进行分解，我们发现仅通过分析权重矩阵即可预测先前描述的跨层头间交互作用。基于所发现的机制，我们展示了通过操纵内部模型表征及编辑模型权重，可显著提升在需要列表回忆的合成任务Laundry List上的性能，通常能使任务准确率提高20%以上。该分析揭示了语言模型预训练中学习到的惊人复杂的可解释结构，有助于理解为何复杂语言模型在简单领域会出现失效，为未来分析更复杂行为提供了基础。

Detecting Multimedia Generated by Large AI Models: A Survey

Abstract

arXiv:2402.00045v4 Announce Type: replace-cross Abstract: The rapid advancement of Large AI Models (LAIMs), particularly diffusion models and large language models, has marked a new era where AI-generated multimedia is increasingly integrated into various aspects of daily life. Although beneficial in numerous fields, this content presents significant risks, including potential misuse, societal disruptions, and ethical concerns. Consequently, detecting multimedia generated by LAIMs has become crucial, with a marked rise in related research. Despite this, there remains a notable gap in systematic surveys that focus specifically on detecting LAIM-generated multimedia. Addressing this, we provide the first survey to comprehensively cover existing research on detecting multimedia (such as text, images, videos, audio, and multimodal content) created by LAIMs. Specifically, we introduce a novel taxonomy for detection methods, categorized by media modality, and aligned with two perspectives: pure detection (aiming to enhance detection performance) and beyond detection (adding attributes like generalizability, robustness, and interpretability to detectors). Additionally, we have presented a brief overview of generation mechanisms, public datasets, online detection tools, and evaluation metrics to provide a valuable resource for researchers and practitioners in this field. Most importantly, we offer a focused analysis from a social media perspective to highlight their broader societal impact. Furthermore, we identify current challenges in detection and propose directions for future research that address unexplored, ongoing, and emerging issues in detecting multimedia generated by LAIMs. Our aim for this survey is to fill an academic gap and contribute to global AI security efforts, helping to ensure the integrity of information in the digital realm. The project link is https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey.

摘要

大型人工智能模型（LAIMs），特别是扩散模型和大语言模型的快速发展，标志着人工智能生成的多媒体内容日益融入日常生活各个领域的新时代。尽管这类内容在诸多领域具有积极意义，但其潜在滥用风险、社会扰乱效应及伦理问题也不容忽视。因此，LAIM生成多媒体的检测技术变得至关重要，相关研究呈现显著增长。然而，目前仍缺乏专门针对LAIM生成多媒体检测的系统性综述研究。为此，我们首次对LAIM生成的文本、图像、视频、音频及多模态内容检测研究进行全面综述，提出按媒体模态分类的新颖方法学分类体系，并从纯检测（旨在提升检测性能）和超越检测（为检测器增加泛化性、鲁棒性和可解释性等属性）两个维度进行梳理。此外，我们还简要概述了生成机制、公开数据集、在线检测工具和评估指标，为该领域研究者与实践者提供宝贵资源。最重要的是，我们从社交媒体视角进行聚焦分析，以揭示其更广泛的社会影响。进一步地，我们指出现有检测技术面临的挑战，并就LAIM生成多媒体检测中尚未探索、持续发展和新出现的问题提出未来研究方向。本综述旨在填补学术空白，为全球人工智能安全事业做出贡献，助力维护数字领域的信息完整性。项目链接详见https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey。

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

Abstract

arXiv:2409.17264v3 Announce Type: replace-cross Abstract: As large language models (LLMs) handle increasingly longer contexts, serving long inference requests of millions of tokens presents unique challenges. We show that existing work for long context inference is largely based on techniques from long context training, and does not handle the high variability in input lengths during inference. This leads to inefficient resource utilization, server fragmentation, and head-of-line (HOL) blocking. We present Medha, an end-to-end system for efficient long-context LLM inference that addresses these challenges through fine-grained time sharing. Medha introduces three key innovations: (1) the mechanism of adaptive prefill chunking to help mitigate HOL blocking with preemption; (2) two new parallelism strategies: Sequence Pipeline Parallelism (SPP) to reduce time-to-first-token by pipelining prefill chunks, and KV-Cache Parallelism (KVP) to lower time-peroutput-token by distributing decoding across servers; and (3) a novel input-length aware least remaining slack scheduling to meet Service Level Objectives (SLOs). Medha enables exact inference scaling beyond 10 million tokens, maintaining high throughput and low latency across mixed-length workloads. Compared to state-of-the-art systems, Medha reduces server fragmentation, cuts median latency by up to 30x, and improves throughput by over 5x, delivering production-scale long-context inference without compromising performance on shorter requests.

摘要

随着大语言模型（LLM）处理上下文长度的不断增加，服务数百万token的长推理请求面临独特挑战。我们发现现有长上下文推理工作主要基于长上下文训练技术，未能有效应对推理时输入长度的高度可变性，从而导致资源利用率低下、服务器碎片化以及队头阻塞（HOL）问题。

本文提出Medha——一个面向高效长上下文LLM推理的端到端系统，通过细粒度分时技术解决上述挑战。Medha包含三项关键创新：（1）自适应预填充分块机制，通过抢占式调度缓解HOL阻塞；（2）两种新型并行策略：序列流水线并行（SPP）通过预填充分块流水线化降低首token延迟，KV缓存并行（KVP）通过跨服务器分布式解码减少单token生成时间；（3）基于输入长度的最小剩余松弛调度算法，确保服务等级目标（SLO）达成。

Medha实现了精确推理规模超过1000万token，在混合长度工作负载下保持高吞吐与低延迟。相比最先进系统，Medha减少服务器碎片化，将中位延迟降低达30倍，吞吐量提升超过5倍，在保持短请求性能的同时实现生产级长上下文推理。

Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits

Abstract

arXiv:2410.18234v2 Announce Type: replace-cross Abstract: We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target and draft models for the acceptance probability to equal one and 2) provide an explicit expression for the optimal acceptance probability. Our theoretical analysis also motives a new class of token-level selection schemes based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.

摘要

我们研究多草稿推测性采样方法，其中提案序列是从不同草稿模型独立采样的。在每个步骤中，令牌级草稿选择方案以有效令牌列表作为输入，并生成一个输出令牌，其分布与目标模型相匹配。先前研究证明，最优方案（即最大化接受输入令牌之一的概率）可转化为线性规划问题的解。本研究表明该最优方案可分解为两步解决方案：第一步采用重要性采样（IS）类方案选择中间令牌；第二步应用（单草稿）推测性采样生成输出令牌。针对两个相同草稿模型的情况，我们进一步：1）建立了目标模型与草稿模型分布间的充要条件，使得接受概率等于1；2）给出了最优接受概率的显式表达式。理论分析还启发了一类基于加权重要性采样的新型令牌级选择方案。实验结果表明，在多种场景下，相较于基线方案，该方法在可实现的块效率和令牌率方面均能取得持续改进。

SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation

Abstract

arXiv:2411.11053v5 Announce Type: replace-cross Abstract: Large language models demonstrate exceptional performance in simple code generation tasks but still face challenges in tackling complex problems. These challenges may stem from insufficient reasoning and problem decomposition capabilities. To address this issue, we propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. This creates a positive feedback loop, enabling continuous improvement. Our method operates entirely through the model itself without requiring additional supervision. By synthesizing natural language reasoning paths and translating them into executable code, the approach ensures analytical accuracy and enhances the success rate in solving complex tasks. Experimental results show that, even without additional supervisory signals, our method achieves performance improvements across different model scales, demonstrating the significant potential of self-improvement in small models. Furthermore, the method remains robust when traditional Chain-of-Thought (CoT) approaches exhibit performance degradation, with notable improvements observed in diversity metrics such as pass@10. We encourage further exploration of reasoning processes within training data to enhance the ability of language models to address complex problems. Our code and data are public at https://github.com/DIRECT-BIT/SRA-MCTS.

摘要

大型语言模型在简单代码生成任务中表现出色，但在处理复杂问题时仍面临挑战。这些挑战可能源于推理和问题分解能力的不足。为解决这一问题，我们提出了一种增强推理的数据生成方法SRA-MCTS，该方法引导模型自主生成高质量的中间推理路径，形成正向反馈循环以实现持续改进。我们的方法完全通过模型自身运行，无需额外监督。通过合成自然语言推理路径并将其转化为可执行代码，该方法确保了分析准确性，并提高了解决复杂任务的成功率。实验结果表明，即使在没有额外监督信号的情况下，我们的方法在不同规模模型上均实现了性能提升，证明了小模型自我改进的巨大潜力。此外，当传统思维链（CoT）方法出现性能下降时，该方法仍保持稳健性，且在pass@10等多样性指标上表现出显著提升。我们鼓励进一步探索训练数据中的推理过程，以增强语言模型解决复杂问题的能力。代码和数据已公开于https://github.com/DIRECT-BIT/SRA-MCTS。

VladVA: Discriminative Fine-tuning of LVLMs

Abstract

arXiv:2412.04378v3 Announce Type: replace-cross Abstract: Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

摘要

对比训练的视觉-语言模型（如CLIP）已成为判别式视觉-语言表征学习的事实标准。然而，这类模型存在语言理解能力局限，常表现出"词袋"行为特征。与此同时，将视觉编码器与大语言模型结合的大型视觉-语言模型（LVLM）虽能进行细致的视觉-语言推理，但其自回归特性使其不适用于判别式任务。

本研究提出融合"两者优势"的新方法：通过判别式微调训练LVLM，使其同时具备强大的判别能力和组合能力。本质上，我们的方法将生成式LVLM转化为判别式模型，释放其结合增强语言理解能力的图像-文本判别潜力。

主要贡献包括：（1）精心设计的训练/优化框架，利用可变长度和粒度的图像-文本对，通过对比损失和下一词预测损失联合训练模型，并通过消融实验验证各组件必要性；（2）结合软提示与LoRA适配器的参数高效适应方法；（3）在同等规模CLIP类模型中实现显著性能提升，包括标准图像-文本检索基准测试和组合性能力的显著进步。

AdaCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Chain-of-Thought

Abstract

arXiv:2501.16154v2 Announce Type: replace-cross Abstract: Large language models have shown impressive multilingual capabilities through pretraining on diverse corpora. While these models show strong reasoning abilities, their performance varies significantly across languages due to imbalanced training data distribution. Existing approaches using sample-level translation for extensive multilingual pretraining and cross-lingual tuning face scalability challenges and often fail to capture nuanced reasoning processes across languages. In this paper, we introduce AdaCoT (Adaptive Chain-of-Thought), a framework that enhances multilingual factual reasoning by dynamically routing thought processes in intermediary ``thinking languages'' before generating target-language responses. AdaCoT leverages a language-agnostic core and incorporates an adaptive, reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Our comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. The results suggest that adaptive reasoning paths can effectively bridge the performance gap between high and low-resource languages while maintaining cultural and linguistic nuances.

摘要

大型语言模型通过在不同语料库上进行预训练，已展现出令人印象深刻的多语言能力。尽管这些模型表现出强大的推理能力，但由于训练数据分布不均衡，其在不同语言间的性能存在显著差异。现有方法采用样本级翻译进行大规模多语言预训练和跨语言调优，但面临可扩展性挑战，且往往难以捕捉跨语言的细微推理过程。本文提出AdaCoT（自适应思维链）框架，该框架通过在生成目标语言响应前动态路由中间"思维语言"的思考过程，从而增强多语言事实推理能力。AdaCoT利用语言无关的核心组件，并采用基于奖励的自适应机制来选择最优推理路径，无需额外预训练。我们在多个基准测试上的综合评估表明，该方法在事实推理质量和跨语言一致性方面均有显著提升，尤其在低资源语言环境下表现出更强的性能增益。结果表明，自适应推理路径能有效弥合高低资源语言之间的性能差距，同时保持文化和语言的细微差异。

Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes

Abstract

arXiv:2501.12106v3 Announce Type: replace-cross Abstract: Tumor documentation in Germany is largely done manually, requiring reading patient records and entering data into structured databases. Large language models (LLMs) could potentially enhance this process by improving efficiency and reliability. This evaluation tests eleven different open source LLMs with sizes ranging from 1-70 billion model parameters on three basic tasks of the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors' notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general. The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation. Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from https://github.com/stefan-m-lenz/UroLlmEval. We also release the dataset as a new valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.

摘要

德国的肿瘤记录工作主要依赖人工完成，需要阅读患者病历并将数据录入结构化数据库。大型语言模型（LLM）有望通过提升效率和可靠性来改进这一流程。本研究评估了11种不同开源LLM（模型参数量级1-70亿）在肿瘤记录三项基础任务中的表现：识别肿瘤诊断、分配ICD-10编码及提取首次确诊日期。为评估模型性能，基于泌尿科匿名医生笔记构建了标注文本片段数据集。研究采用不同提示策略，探究少样本提示中示例数量对效果的影响，并测试LLM的通用能力。Llama 3.1 8B、Mistral 7B和Mistral NeMo 12B模型在任务中表现相当。训练数据不足或参数量低于70亿的模型性能显著较差，而更大模型未展现性能优势。采用非泌尿科医学领域示例也能改善少样本提示效果，证明LLM具备处理肿瘤记录任务的能力。开源LLM在肿瘤记录自动化方面展现出巨大潜力，70-120亿参数模型可能在性能与资源效率间实现最佳平衡。通过针对性微调与精心设计的提示策略，这些模型有望成为未来临床记录的重要工具。评估代码详见https://github.com/stefan-m-lenz/UroLlmEval。本研究同时公开数据集，为德语医学NLP领域缺乏真实易得基准的问题提供了宝贵资源。

JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

Abstract

arXiv:2501.14851v2 Announce Type: replace-cross Abstract: Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning capabilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, are inadequate due to their lack of task complexity, presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated deductive reasoning benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling, and (ii) SOTA non-reasoning models still underperform the human average. All code and data are available at https://github.com/michaelchen-lab/JustLogic

摘要

逻辑推理是大型语言模型（LLMs）的关键能力，近年来大量研究致力于提升其演绎推理性能。然而，现有作为评估和推进LLMs重要工具的演绎推理基准存在明显不足：任务复杂度不足、先验知识作为混杂因素存在，以及错误分析流于表面。为解决这些缺陷，我们提出JustLogic——一个专为严格评估LLMs而设计的合成演绎推理基准。该基准具有三大特征：（i）高度复杂性，能生成多样化的语言模式、词汇及论证结构；（ii）独立于先验知识，消除模型依赖先验知识的优势，确保仅通过演绎推理回答问题；（iii）支持深度错误分析，可探究推理深度与论证形式对模型准确率的异质性影响。基于JustLogic的实验结果表明：（i）最先进的推理型LLMs表现达到或略超人类平均水平，但显著低于人类上限；（ii）最先进的非推理型模型仍逊于人类平均水平。所有代码与数据详见https://github.com/michaelchen-lab/JustLogic。

Reimagining Urban Science: Scaling Causal Inference with Large Language Models

Abstract

arXiv:2504.12345v2 Announce Type: replace-cross Abstract: Urban causal research is essential for understanding the complex dynamics of cities and informing evidence-based policies. However, it is challenged by the inefficiency and bias of hypothesis generation, barriers to multimodal data complexity, and the methodological fragility of causal experimentation. Recent advances in large language models (LLMs) present an opportunity to rethink how urban causal analysis is conducted. This Perspective examines current urban causal research by analyzing taxonomies that categorize research topics, data sources, and methodological approaches to identify structural gaps. We then introduce an LLM-driven conceptual framework, AutoUrbanCI, composed of four distinct modular agents responsible for hypothesis generation, data engineering, experiment design and execution, and results interpretation with policy recommendations. We propose evaluation criteria for rigor and transparency and reflect on implications for human-AI collaboration, equity, and accountability. We call for a new research agenda that embraces AI-augmented workflows not as replacements for human expertise but as tools to broaden participation, improve reproducibility, and unlock more inclusive forms of urban causal reasoning.

摘要

城市因果研究对于理解城市的复杂动态和制定循证政策至关重要。然而，该领域目前面临假设生成效率低下与偏差、多模态数据复杂性障碍以及因果实验方法脆弱性等挑战。大型语言模型（LLMs）的最新进展为重新思考城市因果分析方式提供了契机。本视角通过分析研究主题分类体系、数据源和方法论路径，系统审视当前城市因果研究的结构性缺陷，进而提出由四个独立模块化智能体构成的LLM驱动框架AutoUrbanCI——分别负责假设生成、数据工程、实验设计与执行、结果解释与政策建议。我们提出了严谨性与透明度的评估标准，并反思了人机协作、公平性与问责制的影响。我们呼吁建立新的研究议程，将AI增强的工作流程视为扩展参与度、提升可复现性、实现更包容的城市因果推理工具，而非替代人类专业判断。

MERGE $^3$ : Efficient Evolutionary Merging on Consumer-grade GPUs

Abstract

arXiv:2502.10436v4 Announce Type: replace-cross Abstract: Evolutionary model merging enables the creation of high-performing multi-task models but remains computationally prohibitive for consumer hardware. We introduce MERGE $^3$ , an efficient framework that makes evolutionary merging feasible on a single GPU by reducing fitness computation costs 50 $\times$ while preserving performance. MERGE $^3$ achieves this by Extracting a reduced dataset for evaluation, Estimating model abilities using Item Response Theory (IRT), and Evolving optimal merges via IRT-based performance estimators. Our method enables state-of-the-art multilingual and cross-lingual merging, transferring knowledge across languages with significantly lower computational overhead. We provide theoretical guarantees and an open-source library, democratizing high-quality model merging.

摘要

进化模型融合技术能够创建高性能的多任务模型，但其计算成本对消费级硬件仍具挑战性。我们提出MERGE $^3$ 框架，通过将适应度计算成本降低50倍同时保持性能，使得单GPU实现进化融合成为可能。该框架通过三项创新实现：提取精简数据集进行评估、运用项目反应理论（IRT）估算模型能力、以及基于IRT性能评估器进化最优融合方案。我们的方法实现了最先进的多语言与跨语言模型融合，以显著降低的计算开销实现跨语言知识迁移。研究提供理论保证并开源工具库，推动高质量模型融合技术的普及化。

Privacy-Preserved Automated Scoring using Federated Learning for Educational Research

Abstract

arXiv:2503.11711v2 Announce Type: replace-cross Abstract: Data privacy remains a critical concern in educational research, requiring strict adherence to ethical standards and regulatory protocols. While traditional approaches rely on anonymization and centralized data collection, they often expose raw student data to security vulnerabilities and impose substantial logistical overhead. In this study, we propose a federated learning (FL) framework for automated scoring of educational assessments that eliminates the need to share sensitive data across institutions. Our approach leverages parameter-efficient fine-tuning of large language models (LLMs) with Low-Rank Adaptation (LoRA), enabling each client (school) to train locally while sharing only optimized model updates. To address data heterogeneity, we implement an adaptive weighted aggregation strategy that considers both client performance and data volume. We benchmark our model against two state-of-the-art FL methods and a centralized learning baseline using NGSS-aligned multi-label science assessment data from nine middle schools. Results show that our model achieves the highest accuracy (94.5%) among FL approaches, and performs within 0.5-1.0 percentage points of the centralized model on these metrics. Additionally, it achieves comparable rubric-level scoring accuracy, with only a 1.3% difference in rubric match and a lower score deviation (MAE), highlighting its effectiveness in preserving both prediction quality and interpretability.

摘要

数据隐私仍是教育研究中的关键问题，需严格遵守伦理标准与监管规范。传统方法依赖匿名化与集中式数据收集，但常使原始学生数据面临安全风险，并带来巨大管理负担。本研究提出一种用于教育评估自动评分的联邦学习（FL）框架，无需跨机构共享敏感数据。该方法采用低秩自适应（LoRA）对大型语言模型（LLM）进行参数高效微调，使各客户端（学校）能本地训练且仅共享优化后的模型更新。针对数据异质性，我们实施自适应加权聚合策略，兼顾客户端性能与数据量。基于九所中学NGSS标准的多标签科学评估数据，我们将模型与两种前沿FL方法及集中式学习基线进行对比。结果表明：该模型在FL方法中达到最高准确率（94.5%），各项指标与集中式模型差距仅0.5-1.0个百分点；同时在评分细则层面保持可比精度，细则匹配率差异仅1.3%，且具有更低分数偏差（MAE），证明其能有效兼顾预测质量与可解释性。

Bielik 11B v2 Technical Report

Abstract

arXiv:2505.02410v2 Announce Type: replace-cross Abstract: We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model's parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.

Bielik v3 Small: Technical Report

Abstract

arXiv:2505.02550v2 Announce Type: replace-cross Abstract: We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.

摘要

我们推出Bielik v3系列——专为波兰语处理优化的参数高效生成文本模型（15亿和45亿参数）。研究表明，经过精心优化的较小架构能够实现与庞大模型相媲美的性能，同时显著减少计算资源需求。该方法融合多项关键创新：定制波兰语分词器（APT4）显著提升词汇效率，加权指令交叉熵损失实现跨指令类型的学习平衡，以及基于训练进度动态调整的自适应学习率。这些模型在精心筛选的2920亿标记（涵盖3.03亿份文档）语料库上训练，在多项基准测试中表现卓越，包括Open PL大语言模型排行榜、复杂波兰文本理解基准、波兰EQ-Bench及波兰医学排行榜。其中45亿参数模型的性能可与规模2-3倍的模型竞争，而15亿参数模型在极致紧凑的结构下仍保持强劲表现。这些进展为资源受限应用中实现高质量波兰语AI建立了参数高效语言建模的新基准，尤其为资源受限场景提供了更易获取的高性能解决方案。

Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

Abstract

arXiv:2502.20364v2 Announce Type: replace-cross Abstract: Agentic Generative AI, powered by Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable to specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain here comprises complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research. Here, we introduce a generative AI system that integrates RAG, VS, and KG, constructed via Non-Negative Matrix Factorization (NMF), to enhance legal information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends-challenging tasks that are essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This framework supports legal document clustering, summarization, and cross-referencing, for scalable, interpretable, and accurate retrieval for semi-structured data while advancing computational law and AI.

摘要

基于大型语言模型（LLMs）、检索增强生成（RAG）、知识图谱（KGs）和向量存储（VSs）的代理生成式人工智能，是一项可应用于法律系统、研究、推荐系统、网络安全及全球安全（包括扩散研究）等专业领域的变革性技术。该技术擅长从海量非结构化或半结构化数据中推断关联关系。法律领域的数据具有规模庞大、相互关联、半结构化知识体系与复杂关系等特征，涵盖宪法、法规、条例和判例法。从法律文件及其关系网络中提取洞见并实现精准导航，对有效开展法律研究至关重要。本文提出一种集成RAG、VS与KG（通过非负矩阵分解构建）的生成式AI系统，以增强法律信息检索与AI推理能力，同时减少幻觉现象。在法律系统中，这些技术使AI代理能够识别并分析案件、法规与判例间的复杂关联，揭示潜在关系并预测法律趋势——这些对保障司法公正与提升运作效率至关重要的挑战性任务。我们的系统采用网络爬虫技术，从Justia等公开平台系统化收集法律文本（包括法规、宪法条款和判例法），通过高级语义表征、层级关系与潜在主题发现，弥合传统关键词搜索与语境理解间的鸿沟。该框架支持法律文档聚类、摘要与交叉引用，为半结构化数据提供可扩展、可解释且精准的检索方案，同时推动计算法学与人工智能的发展。

Abstract

arXiv:2505.02847v2 Announce Type: replace-cross Abstract: Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.

摘要

评估大型语言模型（LLM）对人类的理解程度（而非仅对文本的理解）仍是一个开放性挑战。为弥合这一鸿沟，我们提出"具身智能体作为评判者"（SAGE）——一种通过模拟高阶社会认知来评估LLM的自动化框架。SAGE实例化了一个具身智能体，该智能体在交互过程中模拟类人情感变化与内心活动，从而在多轮对话中对被测模型进行更贴近现实的评估。在每轮交互中，智能体通过推理（i）自身情感如何变化、（ii）当前感受以及（iii）应如何回应，生成数值化的情感轨迹与可解释的内心独白。在100个支持性对话场景上的实验表明，最终生成的"具身情感分数"与巴雷特-伦纳德关系量表（BLRI）评分及语句级共情指标呈现强相关性，验证了其心理真实性。我们还建立了涵盖18个商业与开源模型的公开"具身智能体排行榜"，揭示了前沿系统（GPT-4o-Latest、Gemini2.5-Pro）与早期基线模型之间最高达4倍的显著差距，这一差距在传统排行榜（如Arena）中未被体现。因此，SAGE为追踪语言智能体向真正具备共情能力与社会适应性的发展进程，提供了具有理论依据、可扩展且可解释的评估工具。

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract